Marion Coutarel Parcours Data Analyst V2
Projet 6
Juillet 2021

Librairies

Datasets

Customers

Customers born in 2004 seems to be over represented.

Products

Transactions

I erease the minute and second from the date

We have 194 row begining with test. They all concern a highly suspicious product sold -1€ and what seems to be the 2 first customers of the customers files. I assume those are just test runs and drop the rows.

I set Date to a datetime format easier to work with.

Merging datasets

products with transactions

22 product references are not in transactions df.

1 reference is in transactions but not in the products list. It concerns 221 transactions. Those sales won't be analysed

...then with customers

23 of our customers are not in the transactions dataset but all of those listed in the transactions dataset are in the custommers dataset

Revenue Analysis

Sales over time

Dealing with missing values

There is a drop of sales in october 2021

There are some data missing concerning cat 1 in october 2021

For 25 days (from 2021-10-02 to 2021-10-27) there is no records of sales of cat 1 products

In order to deal with those missing values we replace the NaN by the mean to get a estimation of the actual value of October 21 sales.

Sales per month

Sales are decreasing in february (-12% regarding january) that burden the sales increasing tendancy (+5% on the first 11 months) leading to an almost stable revenue from Year 1 (March 2021-Febuary 2022) to Year 2 (March 2022-Febuary 2023) around 6 M€ (+0.3%).

For a better view of sales per category variation over time, we create a base index 100 on march 2022.

Sales amount are steadier this last year regarding of what occurred the year before. The decrease in sales in february is due to a decreased in the sales of all 3 categories but with a bette resilience of Cat 2.

sales per cat per year

At a yearly point of view, Revenue is stable from Year 1 (March 2021-Febuary 2022) to Year 2 (March 2022-Febuary 2023) around 6 M€ (+0.3%). The share of each category is quite stable too (slight increase of cat 2 , slight decrease of cat 0).

sales per day of the week

Session evolution over time

Getting ride of double date sessions

There is only one client per session but there is 1587 sessions with 2 dates (0.5% of sessions dates - representing 0.7% of total sales). We need to get ride of those to go on with the session analysis.

Now we got ride of sessions with date conflict we can create a new df named Session

We create an age category so we can see how our different customers segments spend on our website.

Sessions over time

The decreased that we observe in february is due to a lesser number of sessions -11,4% and unique client - 4% as the average session basket is remarkably stable this year.

Session spending by age category

The youger the customer the higher the customer's basket.
But under 40 represents less than a third of sessions.
We cannot see significative difference between women and men in the purchase profile within age groups.